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A Meta-Analytic Assessment Of Empirical Differences 
In Standard Setting Procedures 

In testing, setting performance standards involves identifying cut scores that divide examinees 
into groups such as pass/fail, master/non-master, or certify/deny certification. Performance standards are 
used to make very important decisions in education and the job market. Standard-setting methods are also 
used to classify test takers into multiple levels of performance. A simple example is assigning grades of A, 
B, C, D, or F to examinees. Standards decide whether people are competent enough to work as teachers, 
school administrators, nurses, dentists, doctors, or other types of professionals. Standards also determine 
whether students are proficient enough to graduate, enter educational institutions, or be placed in certain 
classrooms. 

In setting a standard, there are many methods to choose from, all of which have been attacked and 
defended from both a theoretical and empirical perspective (see Reference section). Many empirical 
studies claim that different standard setting procedures yield different cut scores. Jaeger (1989) 
summarized this research by looking at the results of 12 different studies. These twelve studies reported 
the cut score set by one method with the cut score set by another method. Within these studies, multiple 
standard setting procedures were conducted on each of 32 different examinations. Jaeger calculated the 
ratio of the highest/lowest cut score and the highest/lowest expected failure rate for each examination. 
When analyzed this way, the results indicate that the different methods do produce different cut scores. 

The median ratio of the cut kore was approximately VA, indicating that one procedure was VA times as 
stringent as another. 

Although Jaeger’s findings are interesting, they are not comprehensive. Jaeger acknowledged that 
there was great deal of variation in the ratios. This may be attributable to the nature of the ratios 
themselves. In some of the 32 contrasts, the Angoff standard (1971) may have been the most stringent, 
making it the numerator, while in others, it may have been the least stringent, making it the denominator. 
Furthermore, the ratios seldom compared the difference in cut score of the same two methods. Sometimes, 

I 

a Nedelsky’s cut score (1954) may have been compared to an AngofF, while other times, an Ebel (1972) 
cut score may have been compared to a Contrasting Groups (Livingston & Zieky, 1982). 
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Using meta-analysis, this research takes a deeper look at the studies in Jaeger’s research, by 
comparing cut scores derived by the Nedelsky (1954), Ebel (1972), Angoff (1971) in all of its modified 
versions, Jaeger (1982), and the Borderline/Contrasting Groups methods (Livingston & Zieky, 1982). This 
meta-analysis also looks beyond the articles in the Jaeger study to the entire literature base on standard 
setting procedures, and infers that different standard-setting procedures do not systematically yield 
different cut scores. This result is important because it provides validation for choosing a standard setting 
method less for its statistical & theoretical properties and more for its ease of implementation. Indeed, if 
the decision to use a certain method can be based on issues of implementation, having assurance that the 
choice of method will not systematically influence the cut score produced, testing organizations can be 
more efficient and productive in their test development and maintenance. 



Method 



Data Collection 

Studies were collected from many sources within the published professional literature and papers 
presented at the annual meeting of the American Educational Research Association. Collection methods 
were designed to be comprehensive enough to represent the current state of empirical research conducted 
in the area of standard setting. It is noted that there is a high degree of overlap between studies used in this 
analysis and Jaeger’s (1989). However, the studies included in this analysis were collected from a 
completely independent literature search. In order for an article to be used, it had to provide a comparison 
of at least two types of standard setting methods by stating the cut score that each method rendered as well 
as a measure of the variance or error. The data allowed for over ninety comparisons from ten different 
articles. All standards were produced for multiple choice tests which varied in content, age of examinees, 
importance, and length. The exact standard-setting procedures may have differed in how they were 
executed in each study. The procedures varied in how much judgment was made, what types of normative 
information were provided to the Judges, and how the groups of judges were divided. Nonetheless, each 
cut score was classified by its theoretical underpinnings, e.g., all of the modified-Angoff procedures were 
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grouped together. Among the studies collected, the group of modified- Angoff procedures was the most 
frequently encountered. The following section describes the articles from which data were used. 

Behuniak, Archambault, and Gable (1982) compared the standards set by content specialists when 
they applied the Angoff and Nedelsky procedures to the Connecticut School System Test for reading and 
mathematics. These specialists were split into 8 parallel groups of 3-4 judges. Each group executed one of 
the standard setting procedures by making Judgments on 30 items of either the reading or math test. 

Brennan and Lockwood (1980) used generalizability theory to “characterize and quantify” the 
expected variance in cut scores resulting from the Nedelsky and Angoff procedures. A group of 5 Judges 
ran through both the Angoff and Nedelsky standard setting procedures for a 126-item test in a “health- 

related” subject area. 

- 

Cross, Impara, Frary, and Jaeger (1984) compared the standards set by the Angoff, Jaeger, and 
Nedelsky methods for a national teaching examination focusing on mathematics and elementary education. 
The elementary education examination consisted of 150 items, while the mathematics examination 
consisted of 120 items. For each test, 15 judges were divided up into 3 panels of 5 Judges. Each panel 
conducted one of the standard setting procedures in three iterative sessions using different portions of the 

test and different normative feedback information. 

In a study involving multiple elementary schools, Livingston and Zieky (1989) compared the 
standards set on the ETS Basic Skills Assessments tests for reading and math. In eight different middle 
schools, two groups of Judges performed three standard setting procedures, the Contrasting Groups, 
Borderline Group, and either the Nedelsky or Angoff method. There were 3-5 Judges in each group. In 
each middle school, one group reviewed the math test, while the other group Judged reading. 

Mills (1983) compared the standards set by the Angoff, Contrasting Groups, and the Borderline 
Group methods on Louisiana’s grade basic skill tests (having between 30-60 items each). Six different 
overlapping test forms for both the language arts and math section were reviewed by two groups of Judges. 
Sixteen Judges reviewed all 6 forms of the math examination, while 15 Judges reviewed all 6 forms of the 
language arts examination. 
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Mills and Melican (1988) also compared the standards set for the elementary education and 
mathematics sections of the National Teachers Examination. In this study, four groups of judges were 
formed. Each group performed one method, either the Angoff or the Nedelsky for one section of the NTE. 

Smith and Smith (1988) compared the standards set by the Angoff and Nedelsky procedures. 
Working with the 64-item high school reading competency test in New Jersey, 3 1 judges performed one of 
the two standard setting procedures. These judges were randomly assigned to a procedure, 16 in one 
group, 1 5 in the other. 

Three different standard setting procedures, Ebel, Nedelsky, and Angoff were used to set 

standards on the Missouri College English Test (Halpin and Halpin, 1987; Halpin, Sigmon, and Halpin, 

1983). Three non-parallel groups of judges, 5 graduate students, 5 high school teachers, and 5 university 

- 

faculty executed all three procedures by looking at all 90 items of the test. 

Baron, Rindone, and Prowda (1981) also contrasted the cut scores set for Connecticut’s basic skill 
tests for reading and mathematics. The Angoff, Nedelsky, Contrasting Groups, and Borderline Group, 
methods were employed. In using the first two methods, four groups of approximately 10 judges evaluated 
one section of the examination using either the Angoff or Nedelsky method. For the latter two methods, 
teachers at over 200 schools were asked to evaluate a group of 30 students selected by the principal at 
random. 

In evaluating the Kansas Competency Tests, Poggio, Glasnapp, and Eros (1981) employed the 
Ebel, Angoff, and Nedelsky methods. In this study, cut scores were produced for ten different 
examinations, five reading and five math, for grades 2,4,6,8, and 1 1. For each test, three parallel groups of 
approximately 25 judges evaluated the examination using one of the standards setting methods. For a 
synopsis of all standard setting procedures, number of judges involved, test content and number of effect 
sizes estimates obtained from each study see Table 1. 

Computation of Effect Size 

In order to assess the difference in cut scores produced by each standard setting method, a 
common metric was employed for every cut-score comparison in the data. The standardized magnitude of 
the difference between two compared cut scores, called the effect size, was calculated. Due to the 
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dominant use of the Angoff procedure (Cizek, 1996, Plake 1998), this method was treated as the control 
group in effect size calculations. In viewing this study as a comparison of the Angoff procedure with the 
other procedures, it is appropriate to calculate effect sizes using Glass’s A (Glass, McGaw, & Smith 1981). 

Glass’s A = — — (1) 



where, = cut score set by a modified-Angoff procedure, C; = cut score set by an alternative procedure, 
and Sa = standard deviation of the modified-Angoff cut score. The variance for Glass’s A was calculated 

by: 



Var (A) = 



+ ^ A^ 

niriA 2(«^-l) 



( 2 ) 



where, n^ = number of judges who set cut scores using a modified-Angoff procedure, and n; = number of 
Judges who set cut scores via an alternative method. 

Statistical Analyses of Effect Sizes j- 

The mean effect size was used to determine if the group of cut-score comparisons was 
significantly different from zero. In order to ascertain if there was a significant difference in the effect size 
measures across methods, the effect sizes were analyzed using fixed and random effects one-way ANOVA 
models. 

Effect sizes were grouped to produce a one-factor model of five different comparisons: 
Borderline/modified-Angoff, Contrasting Groups/modified-Angoff, Ebel/modified-Angoff, 
Jaeger/modified-Angoff, and Nedelsky/modified-Angoff. The rational for this separation was to see if the 
five non-modified-Angoff methods produced an effect when separated from the rest. 

The Q statistic (Hedges, 1994) was used to assess the model assumption of homogeneity of 
variance. In the one-factor model, the Q statistic takes the following form; 
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Qbetween” Z^/*(A/* a.*) 
/=! 



(4) 



k mi 2 

QwITHIN “ 2 2 ^/> (Ay”A/«) 
1=1 >1 



(5) 



Q “ Qbetween Qwithin* 



( 6 ) 



In the random effects model the variance component (between-studies variance) was calculated as 



follows: 









k fk 

Z W? / Z W,- 
/=! / >=1 



(7) 



Results 



Fixed effects model 

The overall standardized mean effect size difference was not significantly different from 0 i' 

(A = -0.02, z = -.45, £ >.05). For a graphical representation of all effect size estimates and effect size 
estimates by method for the fixed effects model see Figures 1. The fixed effects model indicated that the 
Borderline (^ = 10.3 1) and the Jaeger = 3.72) methods produced significantly higher cut scores than the 
modified-Angoff methods (£ < .05) while the Nedelsky method produced significantly lower cut scores (Z„ 
= -14.02) (See Figure 2). As expected, the variance was heterogeneous (Q = 676, df=91), both between 
(Qbetween= 329, df = 4) and within (Qwithin = 347, df = 87) the five groups (£ < .05). These results show 
that there may be differences in the cut scores set from different types of standard setting procedures, 
however, due to the heterogeneity of the variance within groups caution should be exercised in believing 
the results derived from this model without further exploration. 
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One method used to control for the heterogeneity of effect sizes is to use a criterion to partition the 
data. For this analysis the number of judges used in the standard setting procedure was used as the 
criterion to control for heterogeneity. Standard Setting often involves a small group of Judges. The studies 
gathered for this research were typical in this respect. The average Angoff group had 14 members, and the 
standard deviation was 10. The nature of these small sample sizes and their great variability had a large 
impact on the results. The effect size estimates from those studies with larger panels had less variance and 
were more influential in the results (see Table 3). 

Therefore, the studies were separated into two groups, those studies with Angoff panels of less 
than 15 Judges and those with 15 or more Judges. When arranged this way, the mean effect sizes were not 

significantly different from zero = -0.034, A = -0.019, p >.05).. Furthermore, there was no 

- -I* 

difference between these two groups (Qbetween=-01)> but there was a significant amount of variability 

within groups (Q^i^in small=l 89, Q,„i,hin large=488). 

Further attempts at reduction of within group variation, such as a two-factor model of size by 
method, were not computed because there were too many empty cells in the matrix. A simpler approach 
would have been to compare the large judge panel group to the small Judge panel group within method, but 
this was impossible due to the confounding “study” effects (See Table 4). We did, however, run an 
analysis using only the small panels of Judges. In this analysis (Qbetween~ 28, Qwithin~ 1^1), the within 
cell variance was greatly reduced. Despite the reduction of heterogeneity achieved through the 
identification and separation of studies based on the number of Judges, the amount of within group 
variance was higher than recommended for continued use of a fixed effects model. 

Random effects model 

Although some of the heterogeneity found in the fixed effects may have been due to randomness 
resulting from sampling variability, it was also most likely due to some uncertainty involved in the 
standard setting process. Regardless of the conceptual position chosen, the number of potential moderator 
variables were too numerous to identify and account for with the small number of studies available for this 

I 

analysis. For this reason it was reasonable to apply a random effects model to the data. When all data 
points were weighed equally, the mean effect size was not significant (A = 0.19, z = 0.17, p> .05). The 
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one factor model also revealed that for all but the Jaeger comparison, the mean effect size was not 
significantly different than 0 (Zj,j 3 ,c ^ 2 ^ -0^) (See Table 2 for all random effects statistics). It should 

be noted that the variance component (between-studies variance) of the Jaeger comparisons was not 
calculated because the Q statistic was not significantly different than 0 (Shadish and Haddock, 1994). 

In assessing the adequacy of the random effects model, it was necessary to investigate the pattern 
of variance for each of the effect size estimates. As seen in Figure 3, the introduction of the additional 
variance component changed the relationship between the effect size estimates. All five of the confidence 
intervals overlapped, whereas for the fixed effects model, only two overlapped. This pattern indicated that 
for the effect size estimates there was no significant difference between standard setting methods, however, 

the variance of these estimates indicated that there was at least as much variation within the standard 

- ■•'i* 

setting methods analyzed as there was between the methods. 



Conclusion 
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Before drawing further conclusions from the results of these analyses, the limitations of the study 
must be identified. Most notably, this meta-analysis was conducted on a small number of studies. Using 
this small of a sample is problematic because it limits the stability and generalizability of the findings.^ 
Specifically, the variance component for the random effects model is assumed to be known although it is 
estimated from the data. With such a small sample, this estimate is subject to a high degree of error. One 
other limitation of the meta-analysis is the presence of “study” effects. The ninety comparisons came from 
only ten studies where many of the effect sizes were correlated distorting the results and conclusions. 
Finally, the statistical power of these analyses was low. To alleviate these problems, more data must be 
identified and collected . 

The strength of what can be concluded from these analyses depends on the conception of the 
problem. If one believes that the data presented for analysis are “true” effect sizes, then the conclusion that 
there is some difference in the cut scores produced by different standard setting methods shopld be 
maintained. If this approach is endorsed, more data must be gathered to allow for further multi-factor 
analyses to control for the heterogeneity of variance encountered in this study. In fact, the field already 
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recognizes this in some respects. Many studies have attempted to explain the heterogeneity in cut scores 
by introducing modifications to the established standard setting procedures. Approaches in the literature 
have included; changing the number of judges involved in the standard setting process, providing the 
judges with normative feedback , and allowing discussion amongst the judges (Brennan & Lockwood, 
1980; Halpin & Halpin, 1987; Koffler, 1980). A future meta-analysis should be conducted to investigate 
the effects of these modifications. 

In the absence of a way to control for these identified sources of variance, other theoretical 

approaches are justified. In another conceptualization of the research question, the data are seen as 

randomly varying effect size estimates. In this approach, effect size variation is inflated due to the addition 

of a randomness component. The effect size estimates in this model are seen as having been drawn 

- 

randomly from a “universe” of effect size estimates rather than the “true” values of these differences. 

Depending on the conceptualization chosen, the results of this study may be interpreted 
differently. For a fixed effects model approach, some standard setting approaches produce significantly 
different cut scores. This interpretation, however, must be tempered by the amount of heterogeneity in the 
model. If a random effects model is endorsed, it is recognized that no significant differences between 
methods are produced. Again, this conclusion must be taken with caution because of the relationship 
between the within method variation and the between method variation. Moving away from these • 
extremes in interpretation, the most important conclusion to be drawn is that the variability within standard 
setting method is at least as large as any difference between standard setting methods. In other words, the 
variability within identifiable and recognized standard setting procedures is too great to be able to make 
definitive statements about the relative differential effect between standards setting methods. 

Although the final analysis is unable to make conclusive statements about systematic differences 
in effect sizes and cut scores produced by the standard setting methods presented, meta-analysis holds 
much promise in its ability to answer these questions in the future. Systematically investigating these 
different approaches and the cut scores they produce would benefit testing and certification organizations, 
providing empirical evidence and justification of the use of a particular standard setting method. 
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Table 2. Statistical Results Tables 
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table 4. Number of Comparisons Bv Study. Size and Procedure 
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